{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5bce0c8b8821b006",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "# Baby EmoLLM"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "894e9316",
   "metadata": {},
   "source": [
    "- 本教程分为两个部分：模型微调和RAG\n",
    "- 这里只做简单的流程演示，让大家明白模型微调和RAG基本原理\n",
    "- 本项目使用的微调框架为XTuner"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e688ce4",
   "metadata": {},
   "source": [
    "## 模型微调"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "051e0571",
   "metadata": {},
   "source": [
    "### 环境安装"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b9a03da",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! pip install transformers\n",
    "# ! pip install torch\n",
    "# ! pip install datasets\n",
    "# ! pip install peft"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e871bbbde40410ac",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### 导入相关包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "569bb07e89714c78",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-11T12:19:35.361923Z",
     "start_time": "2024-04-11T12:19:27.648057Z"
    },
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer\n",
    "from datasets import load_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fa86a98",
   "metadata": {},
   "source": [
    "### 模型下载"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e75bd70",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-08T13:29:50.286054Z",
     "start_time": "2024-04-08T13:20:20.776157Z"
    }
   },
   "outputs": [],
   "source": [
    "# from modelscope.hub.snapshot_download import snapshot_download\n",
    "# snapshot_download(model_id=\"Shanghai_AI_Laboratory/internlm2-chat-1_8b\", cache_dir=\"./models\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb967007",
   "metadata": {},
   "source": [
    "### 加载数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "455d51dd6b964c8c",
   "metadata": {},
   "source": [
    "- 本项目的所有数据集存储在`./datasets`文件夹下\n",
    "- 数据集的介绍请参考`./datasets/README.md`\n",
    "- 微调数据集的构建指南请参考`./generate_data/tutorial.md`\n",
    "- 本实验的微调数据集为`./datasets/processed_single_turn_dataset_1.json`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f81f6cd2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['conversation'],\n",
       "    num_rows: 14041\n",
       "})"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ds= load_dataset(\"json\",  data_files=\"./datasets/processed_single_turn_dataset_1.json\", split=\"train\")\n",
    "ds"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d41d0a2",
   "metadata": {},
   "source": [
    "### 加载模型"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7283eaa040ff8be",
   "metadata": {},
   "source": [
    "- 本实验的模型采用的模型为`internlm2-chat-1_8b`\n",
    "- 你也可以将模型更换为其他模型\n",
    "- 微调策略采用的是`QLoRA`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ad32bc51f11a16d6",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "57a37656a238455c824c90e8a605f875",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "model_dir = \"./Shanghai_AI_Laboratory/internlm2-chat-1_8b\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)\n",
    "model = AutoModelForCausalLM.from_pretrained(model_dir, low_cpu_mem_usage=True, \n",
    "                                            torch_dtype=torch.bfloat16,\n",
    "                                            device_map=\"auto\",\n",
    "                                            load_in_4bit=True,\n",
    "                                            bnb_4bit_compute_dtype=torch.bfloat16,\n",
    "                                            bnb_4bit_quant_type=\"nf4\",\n",
    "                                            bnb_4bit_use_double_quant=True,\n",
    "                                            trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa4d9e7c",
   "metadata": {},
   "source": [
    "### 数据预处理"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9d1e0c5d2c3f007",
   "metadata": {},
   "source": [
    "- 构造数据处理函数，将数据集中的对话转换为模型的输入格式\n",
    "- 当我们使用Xtuner进行微调时，数据处理会自动进行，无需手动处理\n",
    "- 我们只需要提供处理好的对话数据即可"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e081c211",
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_func(example):\n",
    "    MAX_LENGTH = 512\n",
    "    input_ids, attention_mask, labels = [], [], []\n",
    "    instruction = example[\"conversation\"][0][\"input\"]\n",
    "    instruction = model.build_inputs(tokenizer, instruction, history=[], meta_instruction=example[\"conversation\"][0][\"system\"])  \n",
    "    response = tokenizer(example[\"conversation\"][0][\"output\"], add_special_tokens=False)       \n",
    "\n",
    "    \n",
    "    input_ids = instruction[\"input_ids\"][0].numpy().tolist() + response[\"input_ids\"] + [tokenizer.eos_token_id]\n",
    "    attention_mask = instruction[\"attention_mask\"][0].numpy().tolist() + response[\"attention_mask\"] + [1]\n",
    "    labels = [-100] * len(instruction[\"input_ids\"][0].numpy().tolist()) + response[\"input_ids\"] + [tokenizer.eos_token_id]\n",
    "\n",
    "\n",
    "    if len(input_ids) > MAX_LENGTH:\n",
    "        input_ids = input_ids[:MAX_LENGTH]\n",
    "        attention_mask = attention_mask[:MAX_LENGTH]\n",
    "        labels = labels[:MAX_LENGTH]\n",
    "\n",
    "    \n",
    "    return {\n",
    "        \"input_ids\": input_ids,\n",
    "        \"attention_mask\": attention_mask,\n",
    "        \"labels\": labels\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0f4c1886",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parameter 'function'=<function process_func at 0x7fefd48c6b00> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4d5fd61a7fee4a0e85f717df368aa334",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Map:   0%|          | 0/14041 [00:00<?, ? examples/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Dataset({\n",
       "    features: ['input_ids', 'attention_mask', 'labels'],\n",
       "    num_rows: 14041\n",
       "})"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized_ds = ds.map(process_func, remove_columns=ds.column_names)\n",
    "tokenized_ds "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ea3583a",
   "metadata": {},
   "source": [
    "### 配置QLoRA"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5815525df43b36f2",
   "metadata": {},
   "source": [
    "- QLoRA是一种高效的大模型微调策略\n",
    "- 在微调时，我们可以选择对模型的部分参数进行微调\n",
    "- 通过配置`LoraConfig`，我们可以指定微调的目标模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "2c622561",
   "metadata": {},
   "outputs": [],
   "source": [
    "from peft import LoraConfig, TaskType, get_peft_model\n",
    "config = LoraConfig(task_type=TaskType.CAUSAL_LM, target_modules=[\"wqkv\", \"w1\", \"w2\", \"w3\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c11fa72b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, r=8, target_modules=['wqkv', 'w1', 'w2', 'w3'], lora_alpha=8, lora_dropout=0.0, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None)\n"
     ]
    }
   ],
   "source": [
    "# 查看QLoRA配置文件\n",
    "print(config)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "07fb8f68",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = get_peft_model(model, config)  # 根据配置文件在原模型的基础上添加QLoRA模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1e3292d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.enable_input_require_grads()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "490a71f7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "trainable params: 7,077,888 || all params: 1,896,187,904 || trainable%: 0.37326933607525004\n"
     ]
    }
   ],
   "source": [
    "model.print_trainable_parameters()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74f7e637",
   "metadata": {},
   "source": [
    " trainable params ：表示参与训练的参数量\n",
    " all params ： 模型的整个参数量\n",
    " trainable ： 参与训练的参数占整个参数的比例"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7d78f40",
   "metadata": {},
   "source": [
    "### 配置训练参数"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "921d077eb0694af5",
   "metadata": {},
   "source": [
    "- 我们可以通过`TrainingArguments`来配置训练参数\n",
    "- output_dir: QLoRA模型保存路径\n",
    "- per_device_train_batch_size: 每个设备的训练batch_size\n",
    "- gradient_accumulation_steps: 梯度累积步数\n",
    "- logging_steps: 日志输出步数\n",
    "- num_train_epochs: 训练轮数\n",
    "- learning_rate: 学习率\n",
    "- gradient_checkpointing: 梯度检查点\n",
    "- optim: 优化器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6e735e73",
   "metadata": {},
   "outputs": [],
   "source": [
    "args = TrainingArguments(\n",
    "    output_dir=\"./chatbot\",\n",
    "    per_device_train_batch_size=1,\n",
    "    gradient_accumulation_steps=16,\n",
    "    logging_steps=10,\n",
    "    num_train_epochs=1,\n",
    "    learning_rate=3e-4,\n",
    "    gradient_checkpointing=True,\n",
    "    optim=\"paged_adamw_32bit\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b1c155e",
   "metadata": {},
   "source": [
    "### 创建训练器"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b52b744",
   "metadata": {},
   "source": [
    "- model ： 加载了QLoRA适配器的model\n",
    "- args : 训练的参数配置\n",
    "- train_dataset ： 处理后的训练数据集\n",
    "- data_collator ： 数据集规整器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "bf7735cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.\n"
     ]
    }
   ],
   "source": [
    "trainer = Trainer(\n",
    "    model=model,\n",
    "    args=args,\n",
    "    train_dataset=tokenized_ds,\n",
    "    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51197e20",
   "metadata": {},
   "source": [
    "### 模型训练"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfcae7f07281f996",
   "metadata": {},
   "source": [
    "- 完成以上步骤后，我们可以开始训练模型\n",
    "- 本项目所有的训练脚本都存储在`xtuner_config`文件夹下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e8e069e",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9404be5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.save_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f571a31",
   "metadata": {},
   "source": [
    "### 模型评估"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3528d3ca7e4e1f06",
   "metadata": {},
   "source": [
    "- 同时加载训练好的QLoRA模型和原始模型进行推理评估"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "12504ffe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from peft import PeftModel\n",
    "p_model = PeftModel.from_pretrained(model, model_id=\"./chatbot/checkpoint-876\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "8cdca4e7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "我很理解你对心理健康的关注。心理健康就像是一棵茂盛的树，需要我们细心呵护。就像树的根系一样，心理健康需要我们的积极思考、乐观的心态和有效的应对压力的能力。\n",
      "\n",
      "有时候，我们可能会感到心理不健康，这可能是因为我们面临着一些挑战和压力。就像在暴风雨中，树可能会摇摇欲坠，但是只要我们坚守，它就会恢复健康。所以，当你感到心理不健康时，不要独自承受，寻求专业的帮助和支持，与他人分享你的感受，他们会给予你更多的力量和支持。\n",
      "\n",
      "同时，我们也可以通过一些方法来改善心理健康。就像树需要阳光和水分一样，我们也需要找到适合自己的方式来放松自己。这可以包括运动、冥想、艺术创作、与朋友交流等等。这些活动可以帮助我们减轻压力，提升情绪，让我们重新获得内心的平静和健康。\n",
      "\n",
      "最后，记住心理健康是一个长期的过程，而不是一蹴而就的结果。就像树需要时间成长一样，我们也需要耐心和持续的努力来改善自己的心理健康。如果你感觉心理不健康，不要犹豫寻求帮助，并相信自己可以重新获得内心的平衡和健康。\n"
     ]
    }
   ],
   "source": [
    "p_model.eval()\n",
    "print(p_model.chat(tokenizer, \"我感觉我心理不健康\", history=[])[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9abd5eb",
   "metadata": {},
   "source": [
    "## RAG"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0c197bcc7ee801d",
   "metadata": {},
   "source": [
    "- 本项目的RAG实现存储在`rag`文件夹下\n",
    "- 这里只做最基本的RAG实现\n",
    "- RAG的数据采用QA对形式，具体的数据生成方法请参考`scripts/qa_generation/README.md`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db79d017",
   "metadata": {},
   "source": [
    "### 环境安装"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "619674be",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! pip install numpy\n",
    "# ! pip install transformers\n",
    "# ! pip install BCEmbedding\n",
    "# ! pip install faiss-gpu\n",
    "# ! pip install peft"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d02c472e",
   "metadata": {},
   "source": [
    "### 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b21deb8dd40eb887",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-09T01:16:19.269923Z",
     "start_time": "2024-04-09T01:16:19.248902Z"
    }
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "def data_process(dir):\n",
    "    \"\"\"读取jsonl文件，返回问题和答案的列表\"\"\"\n",
    "    with open(dir, 'r', encoding='utf-8') as f:\n",
    "        lines = f.readlines()   \n",
    "        lines = [json.loads(line) for line in lines]\n",
    "        questions = np.array([line['question'] for line in lines])\n",
    "        answers = np.array([line['answer'] for line in lines])\n",
    "        f.close()\n",
    "    return questions, answers\n",
    "\n",
    "questions, answers = data_process('./test.jsonl') # 这里传入你的数据集路径"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7e025356",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "《儿童发展心理学》的主要内容是什么？\n",
      "《儿童发展心理学》主要探讨0一18岁个体心理发展的特征及影响心理发展的相关问题，涵盖有关儿童心理发展的对象与任务、儿童心理发展的研究方法与设计、心理发展的基本理论、影响心理发展的遗传与环境因素，以及儿童在认知、智力、语言、情绪、人格、道德等各个领域的发展特点、趋势及影响因素等。\n"
     ]
    }
   ],
   "source": [
    "print(questions[0])\n",
    "print(answers[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ba4df6a",
   "metadata": {},
   "source": [
    "### 加载模型"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90dd215dbd17148a",
   "metadata": {},
   "source": [
    "- 加载文本向量化模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "3f4e7fb1776aaca2",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-09T01:16:31.562061Z",
     "start_time": "2024-04-09T01:16:23.195224Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/407/anaconda3/envs/llm/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()\n",
      "  return self.fget.__get__(instance, owner)()\n",
      "04/14/2024 12:03:46 - [INFO] -BCEmbedding.models.EmbeddingModel->>>    Loading from `/home/407/bigfile/tangpeng/bce-embedding-base_v1`.\n",
      "04/14/2024 12:03:47 - [INFO] -BCEmbedding.models.EmbeddingModel->>>    Execute device: cuda;\t gpu num: 4;\t use fp16: False;\t embedding pooling type: cls;\t trust remote code: True\n"
     ]
    }
   ],
   "source": [
    "from BCEmbedding import EmbeddingModel\n",
    "embedding_model = EmbeddingModel(model_name_or_path=\"./bce-embedding-base_v1\", trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "320bea7354ed3c37",
   "metadata": {},
   "source": [
    "### 对RAG数据库中的问题进行向量化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "63c2349e1a76aee5",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-09T01:16:46.665339Z",
     "start_time": "2024-04-09T01:16:34.731194Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 6/6 [00:08<00:00,  1.40s/it]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2948, 768)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from tqdm import tqdm\n",
    "questions = questions.tolist()\n",
    "vectors = []\n",
    "for i in tqdm(range(0, len(questions), 512)):\n",
    "    batch_sens = questions[i: i + 512]\n",
    "    batch_inputs = embedding_model.encode(batch_sens, enable_tqdm=False)\n",
    "    vectors.append(batch_inputs)\n",
    "\n",
    "vectors = np.concatenate(vectors, axis=0)\n",
    "print(vectors.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b4b26a8",
   "metadata": {},
   "source": [
    "2948 表示QA数据集大小，有2948个QA对\n",
    "768 表示将一个问题转化为一个768维的向量表示"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0806c3c2",
   "metadata": {},
   "source": [
    "### 建立向量索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "754918991333a471",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-08T14:20:47.166388Z",
     "start_time": "2024-04-08T14:20:47.151423Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "04/14/2024 12:09:03 - [INFO] -faiss.loader->>>    Loading faiss with AVX2 support.\n",
      "04/14/2024 12:09:03 - [INFO] -faiss.loader->>>    Could not load library with AVX2 support due to:\n",
      "ModuleNotFoundError(\"No module named 'faiss.swigfaiss_avx2'\")\n",
      "04/14/2024 12:09:03 - [INFO] -faiss.loader->>>    Loading faiss.\n",
      "04/14/2024 12:09:03 - [INFO] -faiss.loader->>>    Successfully loaded faiss.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading index\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<faiss.swigfaiss.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x7f608c1e9d40> >"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将向量保存到向量数据库中\n",
    "import faiss, os\n",
    "if not os.path.exists(\"large.index\"):\n",
    "    print(\"Building index\")\n",
    "    index = faiss.IndexFlatIP(768)              # 创建索引，768是向量的维度\n",
    "    faiss.normalize_L2(vectors)                 # 对向量进行归一化\n",
    "    index.add(vectors)                          # 向索引中添加向量\n",
    "    faiss.write_index(index, \"large.index\")     # 将索引保存到文件中\n",
    "else:\n",
    "    print(\"Loading index\")\n",
    "    index = faiss.read_index(\"large.index\")     # 从文件中加载索引\n",
    "index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d63bdd12",
   "metadata": {},
   "source": [
    "### 对查询进行向量化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "fb22e7b337753a93",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-08T14:20:49.738883Z",
     "start_time": "2024-04-08T14:20:49.644729Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1, 768)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quesiton = \"最近好烦，有很多事情要做！\"\n",
    "q_vector = embedding_model.encode(quesiton, enable_tqdm=False)\n",
    "q_vector.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de0df8a4",
   "metadata": {},
   "source": [
    "### 召回TOP K"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f57458b660804a4",
   "metadata": {},
   "source": [
    "- 根据question向量，从数据库中召回与之最相似的TOP K"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6cc79a4ab8efa483",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-08T14:30:32.425395Z",
     "start_time": "2024-04-08T14:30:32.408161Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'安慰商场里和妈妈走散的小妹妹': 2278,\n",
       " '小学生常用什么方式来表达自己的心情？': 163,\n",
       " '情绪表达的方式有哪些？': 2428,\n",
       " '小学生不良情绪的表现有哪些？': 210,\n",
       " '2岁儿童主要使用哪些情绪调节策略？': 162,\n",
       " '情绪调节有哪些类型？': 116,\n",
       " '青少年通常如何排解性的压力或宣泄内心焦虑与不安？': 1064,\n",
       " '情绪调节可以分为哪三大类？': 139,\n",
       " '青少年通常如何排解性的压力？': 1065,\n",
       " '儿童的基本情绪主要包括哪些？': 2565}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "faiss.normalize_L2(q_vector)                     \n",
    "scores, indexes = index.search(q_vector, 10)     \n",
    "topk_result = np.array(questions)[indexes]       \n",
    "topk_result = topk_result.tolist()[0]\n",
    "indexes = indexes.tolist()[0]\n",
    "\n",
    "question2index = {}\n",
    "for question, index in zip(topk_result, indexes):\n",
    "    question2index[question] = index\n",
    "    \n",
    "question2index   # 返回的是top10的问题和对应的索引"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b9b579b",
   "metadata": {},
   "source": [
    "### 召回进行排序"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dead9826c88c794",
   "metadata": {},
   "source": [
    "- 使用RerankerModel对召回的结果进行排序，选出最终的TOP 3作为问题回答的上下文"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d54c1846f3deca4b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-04-08T14:34:20.633138Z",
     "start_time": "2024-04-08T14:34:17.794869Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "04/14/2024 12:15:05 - [INFO] -BCEmbedding.models.RerankerModel->>>    Loading from `/home/407/bigfile/tangpeng/bce-reranker-base_v1`.\n",
      "04/14/2024 12:15:05 - [INFO] -BCEmbedding.models.RerankerModel->>>    Execute device: cuda;\t gpu num: 4;\t use fp16: False\n",
      "You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'亲社会行为情绪表达的方式主要有言语表情和非言语表情两大类。其中言语表情是通过一个人言语时的音响、音速、音调等变化来反映不同情绪的。而非言语表情又包括面部表情和体态表情两方面。从情绪调节过程的来源划分，可以将情绪调节分为内部调节和外部调节两大类。其中，内部调节主要由个体自身完成，包括对神经生理、认知体验和动作行为的调节。而外部调节则来源于个体以外的环境，如幼儿痛哭时，大人的抚慰可以帮助幼儿尽快地从痛苦中走出来。外部调节可以分为支持性环境调节和破坏性环境调节两大类。'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from BCEmbedding import RerankerModel\n",
    "canidate = topk_result\n",
    "sentence_pairs = [[quesiton, passage] for passage in canidate]\n",
    "reranker_model = RerankerModel(model_name_or_path=\"./bce-reranker-base_v1\", trust_remote_code=True)\n",
    "# scores = model.compute_score(sentence_pairs)\n",
    "rerank_results = reranker_model.rerank(quesiton, canidate)\n",
    "\n",
    "context = \"\"\n",
    "for i in range(3):\n",
    "    context += answers[question2index[rerank_results[\"rerank_passages\"][i]]]\n",
    "\n",
    "context   # 返回的是利用与查询问题最相似的top3问题对应的答案构建的上下文"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a67b35d9",
   "metadata": {},
   "source": [
    "### 生成回复"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "18c1e1a3",
   "metadata": {
    "ExecuteTime": {
     "start_time": "2024-04-08T14:38:58.501305Z"
    },
    "jupyter": {
     "is_executing": true
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "fe9c043fad0e44f490e1d898a8fe96cd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "你好，我是书生·浦语。\n",
      "从你的描述来看，你可能需要处理一些比较繁重和复杂的事情，需要你的注意力和精力比较集中，可能也会遇到一些情绪上的困扰，比如烦躁不安等。\n",
      "你可以尝试给自己列一个清单，列出你需要做的事情，按照优先级安排，这样可以帮助你更好地处理这些事情。\n",
      "另外，你也可以试着去运动，运动可以释放身体内的压力，缓解情绪上的压力。\n",
      "希望我的回答能够帮助到你。\n",
      "祝好（🐳）\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "\n",
    "model_dir = \"./Shanghai_AI_Laboratory/internlm2-chat-1_8b\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)\n",
    "model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, device_map=\"auto\")\n",
    "\n",
    "# 利用自己微调后的模型进行对话\n",
    "from peft import PeftModel\n",
    "model = PeftModel.from_pretrained(model, model_id=\"./chatbot/checkpoint-876\")\n",
    "\n",
    "model = model.eval()\n",
    "template = \"\"\"使用以下上下文来回答最后的问题。如果你不知道答案，就说你不知道，不要试图编造答案。尽量使答案简明扼要。总是在回答的最后说“谢谢你的提问！”。\\n上下文：\"\"\" +  context + \"\\n最后的问题: \" + quesiton + \"\\n有用的回答:\"\n",
    "response, history = model.chat(tokenizer, template, history=[])\n",
    "\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbfd9227ff1f07c6",
   "metadata": {},
   "source": [
    "- 用作对比，在没有上下文的情况下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "757daa05ba1750d0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "你好，我是小九。\n",
      "最近我也是有很多事情要做，不过我会尝试去分配任务的优先级，先完成最重要的。\n",
      "有时候我们会遇到一些紧急但并不那么重要的任务，可以先放在后面完成。\n",
      "当你觉得有些烦躁时，可以先让自己冷静下来，再思考如何安排自己的时间。\n",
      "祝好！\n",
      "\n"
     ]
    }
   ],
   "source": [
    "template =  quesiton \n",
    "response, history = model.chat(tokenizer, template, history=[])\n",
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
